Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
نویسندگان
چکیده
Humans tend to decompose a sentence into different parts like sth do at someplace and then fill each part with certain content. Inspired by this, we follow the principle of modular design propose novel image captioner: learning Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike widely used neural module networks in VQA, where language (i.e., question) is fully observable, task collocating visual-linguistic modules more challenging. This because only partially for which need dynamically collocate during process captioning. To sum up, make following technical contributions train our CVLNM: (1) distinguishable design—four encoder including one linguistic function words three visual content noun, adjective, verb) another decoder commonsense reasoning, (2) self-attention based controller robustifying (3) part-of-speech syntax loss imposed on further regularizing training CVLNM. Extensive experiments MS-COCO dataset show that CVLNM effective, e.g., achieving new state-of-the-art 129.5 CIDEr-D, robust, being less likely overfit bias suffering when fewer samples are available. Codes available https://github.com/GCYZSL/CVLMN.
منابع مشابه
Stack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...
متن کاملLearning to Guide Decoding for Image Captioning
Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at ...
متن کاملLearning to Evaluate Image Captioning
Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rulebased metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlate...
متن کاملContrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...
متن کاملImage Captioning using Visual Attention
This project aims at generating captions for images using neural language models. There has been a substantial increase in number of proposed models for image captioning task since neural language models and convolutional neural networks(CNN) became popular. Our project has its base on one of such works, which uses a variant of Recurrent neural network coupled with a CNN. We intend to enhance t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Computer Vision
سال: 2022
ISSN: ['0920-5691', '1573-1405']
DOI: https://doi.org/10.1007/s11263-022-01692-8